# Use VLLM as backend

Lighteval allows you to use `vllm` as backend allowing great speedups.
To use, simply change the `model_args` to reflect the arguments you want to pass to vllm.

```bash
lighteval vllm \
    "pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
    "leaderboard|truthfulqa:mc|0|0"
```

`vllm` is able to distribute the model across multiple GPUs using data
parallelism, pipeline parallelism or tensor parallelism.
You can choose the parallelism method by setting in the the `model_args`.

For example if you have 4 GPUs you can split it across using `tensor_parallelism`:

```bash
export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
    "pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tensor_parallel_size=4" \
    "leaderboard|truthfulqa:mc|0|0"
```

Or, if your model fits on a single GPU, you can use `data_parallelism` to speed up the evaluation:

```bash
lighteval vllm \
    "pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,data_parallel_size=4" \
    "leaderboard|truthfulqa:mc|0|0"
```

Available arguments for `vllm` can be found in the `VLLMModelConfig`:

- **pretrained** (str): HuggingFace Hub model ID name or the path to a pre-trained model to load.
- **gpu_memory_utilisation** (float): The fraction of GPU memory to use.
- **revision** (str): The revision of the model.
- **dtype** (str, None): The data type to use for the model.
- **tensor_parallel_size** (int): The number of tensor parallel units to use.
- **data_parallel_size** (int): The number of data parallel units to use.
- **max_model_length** (int): The maximum length of the model.
- **swap_space** (int): The CPU swap space size (GiB) per GPU.
- **seed** (int): The seed to use for the model.
- **trust_remote_code** (bool): Whether to trust remote code during model loading.
- **add_special_tokens** (bool): Whether to add special tokens to the input sequences.
- **multichoice_continuations_start_space** (bool): Whether to add a space at the start of each continuation in multichoice generation.

> [!WARNING]
> In the case of OOM issues, you might need to reduce the context size of the
> model as well as reduce the `gpu_memory_utilisation` parameter.
